Once you have identified relevant data sources and collected it into one cohesive dataset, it's easy to assume the data is ready to be fitted to a model. However, in the real world data is messy. Some values may be missing, integers could be stored as strings, or calculated fields could be miscalculated. Data Wrangling is the process of identifying all these errors and gets rid of them. This tutorial will cover the following learning objectives:
What is Data Wrangling?
How to Handle Missing Values
Data Wrangling with Pandas
What is Data Wrangling?
Summary
Data Wrangling, also known as Data Cleaning, is the process of transforming raw data into interpretable data.
In the context of Machine Learning, Data Wrangling is a powerful process that allows you to analyze your data using a process called Exploratory Data Analysis.
There are six steps in the Data Wrangling process:
Discovering. In this step, you analyze the schema (structure) of your dataset to understand what each feature represents, its potential impact on the label, and potential features that could be extracted from current features.
Structuring. Most of the time, you will be working with nested structures (e.g., JSON strings, nested arrays) that will need to be expanded to obtain the actual values. In this step, you will obtain the actual structure of the data from your dataset and normalize the features across all values. This invovles ensuring all values have the same structure to keep your data consistent.
Cleaning. In this step, you will identify steps that need to be taken to make sure your features have the correct data types, nested structures have been extracted and aligned, null/missing values have been handled appropriately (see next sub-section for details), and categorical variables are labeled appropriately.
Enriching. Once you have a formalized pipeline to clean your data, it's wise to collect more data to provide your model with more clarity and reduce bias in its predicitons.
Validating. In this step, you will apply your organizations Data Quality standards to ensure categories are labeled appropriately (e.g., all records representing "Walmart" all have the same spelling), numeric values are within an acceptable range (e.g., no customers over the age of 100), and all numeric features are of a numeric data type (e.g., an integer column could be shown as a string by your compiler).
Publishing. In this step, you export your cleaned and validated data to your organization's data repository (e.g, an S3 Bucket, Data Warehouse, Data Lake) to be consumed by other models or reporting dashboards.
NOTE: Data Wrangling typically takes 70-80% of a Machine Learning project, thus it's arguably the most boring. However, if this step is skipped, you won't have a job much longer.
Handling Missing Values
Summary
Missing values are a common occurrence in source systems. These can occur due to a skipped entry on a survey, an unknown value collected from an IOT sensor, or a value that couldn't be computed due to lack of data in other fields.
When identifying missing values in a dataset, it can be easy to simply filter them out or fill them with a default value. However, it's HIGHLY recommended that you analyze your data model and see if you can retrieve the data from other source systems (e.g., perform a join to collect data from other tables in a data warehouse).
If only 1% of the values in a particular field are missing, it's generally acceptable to omit those records from the dataset. On the other hand, if 99% of the values in a particular field are missing, it's generally acceptable to omit the field from the dataset.
It's generally acceptable to omit missing data from a dataset when the values in the other fields are random in nature and are correlated in any sense. However, if the records with missing values have other field values in common, it's worth investigating further using EDA.
One of the most common ways ML Engineers fill missing values is by using a process called Imputation. Imputation is the process of inferring the value of a missing data point by analyzing the points around it. This is done using several methods including:
Mean/Median Imputation: This is used to fill in missing values for numerical fields. In this method, you find the Mean or Median of the existing values in the field and use that value to fill in the missing values. This keeps the value normally distributed.
Categorical Imputation: This is used to fill in missing values for non-numeric fields. In this method, you find the most common value in the field and use that to fill in the missing values. This should only be used when one category occurs more frequently in the field.
Interpolation: This is used to fill in missing values for numeric fields on a time-series axis. In this method, you plot the existing data on a line chart and then connect the gaps by getting the average of the points on either end of the gap.
Supervised Imputation: This is used to fill in the missing values of any specified field using a supervised model. In this method, you train a model using your current dataset and set the field you want to impute as your label. The model will then use it's knowledge to predict the values based on the features provided.
NOTE: Be careful when omitting records from a dataset. Although they may contain missing values, they could provide valuable insight to the model with the values stored in the other fields.
NOTE: A general rule of thumb is to omit features when 70% or more of the values are missing.
NOTE: The Supervised Imputation method should be used sparingly, as it adds compelxity to your project and forces you to change your model for only a small portion of your data.
Data Wrangling with Pandas
Summary
Pandas is a popular Python library used by Data Scientists and ML Engineers to perform Data Wrangling and EDA.
Pandas is built on the NumPy library and thus supports the same data types.